This lecture focuses on theoretical variogram functions and the fitting to experimental data.
It is possible to work with spatial covariance functions in geostatistics, like we used one in the last lecture. In this lecture, we want to focus on the most common tool of a variogram, which is very closely related to the covariance function we built in the last lecture.
The variogram comes with a few advantages
In the last lecture, we implemented each and every calculation ourselves. This was important to understand, what's going on. In this lecture, we will use thrid-party Python modules to save some time. New content will again be introduced with examples how to algorithm the shown problems. This can then be directly implemented in other languages than Python
First, we need to go back to the Covariance function we used before:
For a set of observations $x$ and another set of observations $x_h$ at a spatial lag $h$, their covariance was defined as:
Where $N$ is the length of $u$ and $\mu$ is their respective expected value. That means, calculating the covariance at a specific lag, the two $\mu$ are constant for all elements in the sum. We can solve the parenthesis in the equation above to:
If we want to have a good estimation for the population from the covariance as defined above, we have to make sure, that the three expected values are actually a good measure for the samples distribution:
That's the case for a normal distribution, as all expected values are just the arithmetic mean values. Thus, we require that:
This is called second order stationarity, as the the mean and the variance of the Variable should be stationary
But we run into more assumptions/limitations here. Think of the Covariance function at the lag 0
$$ Cov(0) = Cov(x,x) := Var(x,x) $$Thus, we require the empirical covariance function to fit to the Variance at lag 0, which is often difficult to reach due to the limited amount of point pairs at very close distance.
Further, we are seeking for the a special lag $r$, at which samples become statistically independed. Theoretical Convariance model function will not map values $Cov < 0$, but in turn empirical values might be negative.
Thus we need means that can in principle model unbounded increasing dissimilarities with distance.
That function is called a semi-variogram. It does not use the covariance, but the semi-variance and in an case of stationarity it's defined like:
$$ \gamma(h) = \sigma^2 - C(h) $$The semi-variance can be calculated like:
$$ \gamma(h) = \frac{1}{2N} \sum_{i=1}^N (x_i - x_{h,i})^2 $$The semi-variance has a few advantages over using the covariance function:
0, or larger. Which fits better to real world data that often is prone to measurement uncertainty or small scale variationsOne of the (Python) modules for geostatistics is scikit-gstat. To actually calculate a variogram, we will use this module to obtain all the data we used in the last lecture with less effort.
For exercises, you can of course also reuse the code from the last lecture.
The main advantage of using the Variogram class in scikit-gstat is that it will make all the intermediate calculations steps available to us.
# using the same settings as in the last lecture
V = skg.Variogram(coords, sample.z, maxlag='median', n_lags=6, normalize=False)
Warning: 'harmonize' is deprecated and will be removedwith the next release. You can add a 'SKG_SUPPRESS' environment variable to suppress this warning.
The 1D distance array is available as V.distance, the 2D distance matrix as V.distance_matrix. The bin edges are available as V.bins and the grouping array (1D version) as V.lag_groups.
print(V.bins.round(2))
print(V.distance[:10].round(2))
print(V.lag_groups()[:10])
[ 8.64 17.29 25.93 34.57 43.21 51.86] [19.65 12.21 15.81 66.04 74.46 22.2 62.1 77.52 35. 30.41] [ 2 1 1 -1 -1 2 -1 -1 4 3]
The values are also available, as well as the pairwise differences. The class again makes the 1D and 2D version available. But there's also a small inconsistency in the class. The actual observation are available as V.values. The pairwise differences are at V._diff as 1D and V.distance_matrix as 2D.
print(V.value_matrix[:10, :10].round(2))
print('----------------------------')
print(V._diff[:10].round(2))
[[0. 0.04 0.49 0.35 1.49 0.71 0.78 0.71 2.37 0.97] [0.04 0. 0.53 0.39 1.45 0.75 0.74 0.75 2.41 0.93] [0.49 0.53 0. 0.14 1.98 0.22 1.27 0.22 1.87 1.46] [0.35 0.39 0.14 0. 1.84 0.36 1.13 0.36 2.02 1.32] [1.49 1.45 1.98 1.84 0. 2.2 0.71 2.2 3.86 0.52] [0.71 0.75 0.22 0.36 2.2 0. 1.49 0. 1.66 1.68] [0.78 0.74 1.27 1.13 0.71 1.49 0. 1.49 3.14 0.19] [0.71 0.75 0.22 0.36 2.2 0. 1.49 0. 1.66 1.68] [2.37 2.41 1.87 2.02 3.86 1.66 3.14 1.66 0. 3.34] [0.97 0.93 1.46 1.32 0.52 1.68 0.19 1.68 3.34 0. ]] ---------------------------- [0.04 0.49 0.35 1.49 0.71 0.78 0.71 2.37 0.97 0.05]
Thus, now we are settled to calculate the semi-variances:
k = 6
vario = []
for _k in range(k):
squared_differences = []
for idx, group in enumerate(V.lag_groups()):
if group == _k: # right bin
squared_differences.append(V._diff[idx]**2)
# gamma is sum divided by 2 * length
vario.append(sum(squared_differences) / (2*len(squared_differences)))
show(variogram)
median, which is roughly at 50 metersThe samples were taken from a 100x100 area, so we can go to 100 as a meaningful maximum lag distance setting.
To resolve the larger distance, we should also increase the number of bins
For this to happen, we need to go back and re-define the bins, then re-index the lag class grouping and iterate over all groups to collect all pair-wise differences in the new lags bins. Then, vario can be re-calculated.
Or, we let the scikit-gstat Variogram class demonstrate, when object-orientated programming is a good idea:
V.maxlag = 100
V.n_lags = 8
That's it.
The resulting empirical, or experimental, variogram can be accessed from a property of same name:
print(V.experimental.round(2))
[0.07 0.26 0.53 0.92 0.95 0.79 0.71 0.76]
show(variogram)
This way, you can check hundreds of fine-tuned values with ease. But it is neccessary to understand what is happening under the hood.
Empirical variograms already tell us a lot about spatial structure and an apparent spatial dependency in the data set. Now, we want to describe it more systematically through a mathematical function.
That has a lot of advantages including:

0A very helpful diagnostic tool, when the parameters above have been calculated is to look at the nugget/sill ratio.
The nugget is the share of the total variance that you will not be able to model spatially, no matter what. Therefore, using this ratio can give you an idea of how much variance in the sample the model can explain. Keep in mind that this is a statistical model of the sample, not a feature of the population.
The most used geostatistical model is the so called spherical model. It is defined like:
$$ \gamma(h) = b+ C_0 * \left(\frac{3}{2}*\frac{h}{a} - \frac{1}{2}*\left(\frac{h}{a}\right)^3 \right) $$if $h \leq a $ and
$$ \gamma(h) = b + C_0$$otherwise. here: $C_0$ is the sill, $b$ is the nugget and $a$ is the variogram parameter range, which is not the effective range $r$. For the case of a spherical model $a := r$
def spherical(h, r, C0, b):
a = 1 * r
if h <= a:
return b + C0 * (1.5*(h / a) - 0.5 * (h / a)**3)
else:
return b + C0
x = list(range(100))
y = [spherical(_, 45, 13, 2) for _ in x]
show(models)
The exponential model is also quite common in geoscience. It's very similar to the spherical model, but is way sharper increasing at short lags and smooths out at larger lags.
$$ \gamma(h) = b + C_0 * \left(1 - exp\left(-\frac{h}{a} \right) \right) $$with the relation between effective range $r$ and range paramter $a$
$$ a = \frac{r}{3}$$def exponential(h, r, C0, b):
a = r / 3.
return b + C0 * (1 - np.exp(- h / a))
y2 = [exponential(_, 45, 13, 2) for _ in x]
show(models)
The last of the three most important theoretical model is the Gaussian model. It can model variables, that are quite similar up to a specific lag and then drastically increase in dissimilarity.
$$ \gamma(h) = b + C_0 * \left(1 - exp\left(-\frac{h^2}{a^2} \right) \right)$$and $$ a = \frac{r}{2} $$
In literature, you will also find $a = \frac{r}{\pi}$ and $a = \frac{r}{3}$.
def gaussian(h, r, C0, b):
a = r / 2.
return b + C0 * (1 - np.exp(- h**2 / a**2))
y3 = [gaussian(_, 45, 13, 2) for _ in x]
show(models)
One of the most important steps in geostatistics, is fitting the theoretical model to empirical data from the data sample. If your model does not represent your data well, you will run into issues during interpolation. No matter how sophisticated or performant the algorithm is, it will fail.
The easiest, but in my opinion also most effective way to fit a model to data is to choose the parameters manually. Let's recall the experimental variogram:
show(variogram)
I would say the nugget=0. I would select the sill around sill=0.76 in order to choose a bit longer effective range at range=70. That will fit the first few bins better at the cost of the more distant lags.
r = 70
C0 = 0.76
b = 0
y1 = [spherical(_, r, C0, b) for _ in x]
y2 = [exponential(_, r, C0, b) for _ in x]
y3 = [gaussian(_, r, C0, b) for _ in x]
show(variogram)
Obviously, the Gaussian model is describing the first few bins very well, if we accept he remaining bins to be fitted less precisely. The spherical model is more or less ok, the exponential is quite off and not describing the data well.
The field of automatic fitting is imense. And it is by no means bound to the field of geostatistics.
What we are actually trying to achieve is a regression, we want to fit a function to an independent variable to predict the dependent variable as good as possible. There are hundreds of methods out there, very specialized and very generalized ones.
We will implement a least squares algorithm, because it is easy to implement and easy to understand, while still being fairly performant for our usecase.
We will implement a brute force version thereof.
The basic idea is to find the optimal parameter set from a given parameter space, to predict the dependent variable with as little error as possible.
Or in other words: we need an objective function that will give us the error of the function, when the actual value is known:
I like the term loss function as well, which is more common in the machine learning world.
Now, with that function the optimization problem can be solved by minimizing the loss function. We use the one combination of range and sill, that yields the smallest root mean squared error, or RMSE.
$$ RMSE = \sqrt{\frac{\sum_{i=1}^N (y^* - y)^2}{N}} $$where $y$ would be the observation and $y^*$ the prediction for the same value of $x$.
def loss(func, bins, gamma, _range, _sill):
mod = [func(h, _range, _sill, 0) for h in bins]
dev = [(m - o)**2 for m,o in zip(mod, gamma)]
return np.sqrt(sum(dev) / len(dev))
There are a few things to consider:
spherical, exponential or gaussian in our casex and y in a general least squares implementationnugget=0. The least squares will of course with any number of parameters (in theory)
The only missing thing now are the parameters to plug in. Which values should be use?
This is where brute force, sets in. We will try all values.
That's in most cases not really a clever idea. But in the case of variogram fitting, we can dramatically limit the parameter space:
< 0 as $\gamma(h) >= 0$ by definition100 in our case). This is the natural upper limit for the range.2 x Var. Much more will just be an effect of small sample sizes in the respective bin Finally, the step between two parameters can be limited as well. If we bin distances e.g. like:
V.bins.round(1)
array([ 12.5, 25. , 37.5, 50. , 62.5, 75. , 87.5, 100. ])
Why would be test the range parameter 16.01, 16.02, 16.03 and so on?
We can easily split up both parameter ranges into 100 chunks
# 0 <= range <= 100
r_test = list(range(100))
# 0 <= sill <= sample variance
s_test = [(_ / 100) * 2*sample.z.var() for _ in range(100)]
# parameter 'mesh'-grid
params = []
for _r in r_test:
for _s in s_test:
params.append((_r, _s, ))
print('Testing %d parameter combinations' % len(params))
Testing 10000 parameter combinations
10000 combinations is something that our computer can easily handle. Time to brute force!
result = []
t1 = time()
for _range in r_test:
r_result = []
for _sill in s_test:
r_result.append(loss(gaussian, V.bins, V.experimental, _range, _sill))
result.append(r_result)
t2 = time()
print('Fitting took %.3f seconds' % (t2 - t1))
/usr/share/miniconda/lib/python3.7/site-packages/ipykernel_launcher.py:4: RuntimeWarning: divide by zero encountered in double_scalars after removing the cwd from sys.path.
Fitting took 1.576 seconds
show(optim)
We can see a few things here:
To minimize, we need the smallest RMSE, represented by the green colors
The small sill values yield so bad results, that the range didn't matter. That's expected.
The effect of the range is only visible for specific sill value range
Our manual fitting with range=70 and sill=0.76 was already close to optimality.
Next, we can extract the parameter set of smallest RMSE
# we store the smallest rmse and its index
low_index = (0,0)
low_rmse = result[0][0]
for i in range(len(r_test)):
for j in range(len(s_test)):
if result[i][j] < low_rmse:
low_index = (i,j)
low_rmse = result[i][j]
# extract parameters
opt_range = r_test[low_index[0]]
opt_sill = s_test[low_index[1]]
print('Best Parameter set: range=%.0f sill=%.2f [RMSE: %.3f]' % (opt_range, opt_sill, low_rmse))
Best Parameter set: range=66 sill=0.82 [RMSE: 0.106]
Apply the Gaussian model, like we did before:
y_opt = [gaussian(_, opt_range, opt_sill, 0) for _ in x]
show(fit_compare)
If we consider the best 5% of all tested parameter sets to be valid solutions for the minimizing problem, we can plot all their variograms into the figure as well, to get a better idea of the spread.
show(fit_compare)
We can implement this into the least squares as well. We will, for illustrating the procedure, just give the three first bins tripple weight.
def loss_adjust(func, bins, gamma, _range, _sill):
mod = [func(h, _range, _sill, 0) for h in bins]
dev = [(m - o)**2 for m,o in zip(mod, gamma)]
# give tripple weight.
dev[:3] = [_ * 3. for _ in dev[:3]]
return np.sqrt(sum(dev) / len(dev))
result_adj = []
t1 = time()
for _range in r_test:
r_result = []
for _sill in s_test:
r_result.append(loss_adjust(gaussian, V.bins, V.experimental, _range, _sill))
result_adj.append(r_result)
t2 = time()
print('Fitting took %.3f seconds' % (t2 - t1))
/usr/share/miniconda/lib/python3.7/site-packages/ipykernel_launcher.py:4: RuntimeWarning: divide by zero encountered in double_scalars after removing the cwd from sys.path.
Fitting took 1.529 seconds
show(row(optim, optim_adj))
While the RMSE slightly drop, which is expected, the green area narrows down. This makes it easier to select parameter sets from the parameter space
show(column(fit_adj, fit_compare))
We can see a few things here: